-
Notifications
You must be signed in to change notification settings - Fork 55
Use XGrammar for structured output generation #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Updates the on-device structured JSON generation path to use XGrammar-based token masking, replacing the prior schema-constrained generator flow in the MLX and Llama model implementations.
Changes:
- Add
swift-xgrammardependency and integrateXGrammarinto the package target. - Implement XGrammar-driven structured JSON generation loops for
MLXLanguageModelandLlamaLanguageModel. - Add new structured-generation error cases (
schemaEncodingFailed,grammarMismatch) in both models.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| Sources/AnyLanguageModel/Models/MLXLanguageModel.swift | Switch structured JSON generation to XGrammar matcher + token-bitmask sampling. |
| Sources/AnyLanguageModel/Models/LlamaLanguageModel.swift | Switch structured JSON generation to XGrammar matcher + token-bitmask sampling; make generation async. |
| Package.swift | Add swift-xgrammar dependency and link XGrammar product to the main target. |
| Package.resolved | Update resolved pins to include swift-xgrammar (and other pin changes). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let token = try backend.sample(using: bitmask, applyMask: needsMask) | ||
| if backend.endTokens.contains(token) { | ||
| break | ||
| } | ||
| guard matcher.accept(Int32(token)) else { | ||
| throw LlamaLanguageModelError.grammarMismatch | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation loop breaks immediately when a sampled token is in endTokens, before checking/advancing the grammar matcher. If the model samples EOS/EOT before the grammar has fully terminated (or in a step where masking is not applied), this will return incomplete/invalid JSON. Consider only stopping on end tokens when matcher.isTerminated is already true (or accept the token into the matcher and verify termination) so early EOS doesn’t truncate the structured output.
| private func tokenizerInfo( | ||
| for vocab: OpaquePointer, | ||
| vocabSize: Int, | ||
| stopTokens: Set<Int> | ||
| ) throws -> TokenizerInfo { | ||
| guard vocabSize > 0 else { | ||
| throw LlamaLanguageModelError.contextInitializationFailed | ||
| } | ||
|
|
||
| var encodedVocab: [String] = [] | ||
| encodedVocab.reserveCapacity(vocabSize) | ||
| for tokenId in 0 ..< vocabSize { | ||
| let token = llama_token(tokenId) | ||
| encodedVocab.append(tokenToText(vocab: vocab, token: token) ?? "") | ||
| } | ||
|
|
||
| let stopTokenIDs = stopTokens.map { Int32($0) } | ||
| return try TokenizerInfo( | ||
| encodedVocab: encodedVocab, | ||
| encoding: .byteFallback, | ||
| stopTokenIDs: stopTokenIDs, | ||
| addPrefixSpace: false | ||
| ) | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building TokenizerInfo reconstructs the entire encodedVocab (size = vocabSize) for every structured-generation request. This is expensive and deterministic for a given model/vocab; consider caching the computed TokenizerInfo (keyed by vocab pointer + vocabSize, or by model ID) to avoid repeatedly traversing the entire vocabulary.
| .product(name: "XGrammar", package: "swift-xgrammar"), | ||
| .product( |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XGrammar is only imported/used inside the MLX and Llama conditional compilation blocks, but the target depends on it unconditionally. This forces consumers to fetch/build swift-xgrammar even when neither trait is enabled, and can break traitless builds if that dependency has platform/toolchain constraints. Consider making the XGrammar product dependency conditional on the MLX and/or Llama traits (similar to MLXLLM / LlamaSwift).
| .product(name: "XGrammar", package: "swift-xgrammar"), | |
| .product( | |
| .product( | |
| name: "XGrammar", | |
| package: "swift-xgrammar", | |
| condition: .when(traits: ["MLX", "Llama"]) | |
| ), | |
| .product( |
| let backend = try MLXTokenBackend( | ||
| context: context, | ||
| input: lmInput, | ||
| parameters: generateParameters, | ||
| maximumTokens: maxTokens, | ||
| endTokens: [] | ||
| ) | ||
|
|
||
| var generator = try ConstrainedJSONGenerator(backend: backend, schema: schema) | ||
| let json = try generator.generate() | ||
| let jsonSchema = try jsonSchemaString(for: schema) | ||
| let grammar = Grammar(jsonSchema: jsonSchema, formatting: .compact, strictMode: true) | ||
| let tokenizerInfo = try tokenizerInfo( | ||
| for: context.tokenizer, | ||
| vocabSize: backend.vocabSize, | ||
| stopTokens: backend.endTokens | ||
| ) | ||
| let matcher = try await grammar.matcher( | ||
| for: tokenizerInfo, | ||
| stopTokens: backend.endTokens.map { Int32($0) }, | ||
| terminatesWithoutStopToken: true | ||
| ) |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structured generation constructs MLXTokenBackend with endTokens: [], which disables the backend’s normal EOS/EOT end-token detection. With the new XGrammar path, stopTokens and the loop’s early-stop check both rely on backend.endTokens, so this ends up passing an empty stop-token set to the matcher and prevents clean termination via EOS. Consider passing endTokens: nil here (use the backend’s default end tokens) and plumbing those into the matcher / termination logic so generation can stop cleanly when the model emits EOS after the grammar is satisfied.
| if applyMask { | ||
| var allowedIndices: [UInt32] = [] | ||
| allowedIndices.reserveCapacity(vocabSize) | ||
| for tokenId in 0 ..< vocabSize where bitmask.isTokenAllowed(tokenId) { | ||
| allowedIndices.append(UInt32(tokenId)) | ||
| } | ||
| guard !allowedIndices.isEmpty else { | ||
| throw MLXLanguageModelError.grammarMismatch | ||
| } | ||
| let allowedArray = MLXArray(allowedIndices) | ||
| let maskedLogits = full(logits.shape, values: -Float.infinity) | ||
| maskedLogits[0..., allowedArray] = logits[0..., allowedArray] | ||
| let sampledToken = sampler.sample(logits: maskedLogits) | ||
| return sampledToken.item(Int.self) | ||
| } |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The masked sampling path is O(vocabSize) per generated token (scans every token ID, builds an indices array, and allocates a full maskedLogits tensor each step). For typical vocab sizes (50k–200k) this can be a major bottleneck for structured generation. Consider reusing buffers across steps and/or applying the mask more directly (e.g., materializing only the mask once per step without reserving vocabSize, or using a bitmask→indices iterator if XGrammar exposes one) to avoid repeated full-vocab scans and allocations.
Follow-up to #106